59 research outputs found
HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales
Current captioning datasets focus on object-centric captions, describing the
visible objects in the image, e.g. "people eating food in a park". Although
these datasets are useful to evaluate the ability of Vision & Language models
to recognize and describe visual content, they do not support controlled
experiments involving model testing or fine-tuning, with more high-level
captions, which humans find easy and natural to produce. For example, people
often describe images based on the type of scene they depict ('people at a
holiday resort') and the actions they perform ('people having a picnic'). Such
descriptions draw on personal experience and commonsense assumptions. We
present the High-Level Dataset a dataset extending 14997 images from the COCO
dataset, aligned with a new set of 134,973 human-annotated (high-level)
captions collected along three axes: scenes, actions, and rationales. We
further extend this dataset with confidence scores collected from an
independent set of readers, as well as a set of narrative captions generated
synthetically, by combining each of the three axes. We describe this dataset
and analyse it extensively. We also present baseline results for the High-Level
Captioning task
Invisible to People but not to Machines:Evaluation of Style-aware Headline Generation in Absence of Reliable Human Judgment
We automatically generate headlines that are expected to comply with the specific styles of two different Italian newspapers. Through a data alignment strategy and different training/testing settings, we aim at decoupling content from style and preserve the latter in generation. In order to evaluate the generated headlines’ quality in terms of their specific newspaper-compliance, we devise a fine-grained evaluation strategy based on automatic classification. We observe that our models do indeed learn newspaper-specific style.Importantly, we also observe that humans aren’t reliable judges for this task, since although familiar with the newspapers, they are notable to discern their specific styles even in the original human-written headlines. The utility of automatic evaluation goes therefore beyond saving the costs and hurdles of manual annotation, and deserves particular care in its design
GePpeTto Carves Italian into a Language Model
In the last few years, pre-trained neural architectures have provided
impressive improvements across several NLP tasks. Still, generative language
models are available mainly for English. We develop GePpeTto, the first
generative language model for Italian, built using the GPT-2 architecture. We
provide a thorough analysis of GePpeTto's quality by means of both an automatic
and a human-based evaluation. The automatic assessment consists in (i)
calculating perplexity across different genres and (ii) a profiling analysis
over GePpeTto's writing characteristics. We find that GePpeTto's production is
a sort of bonsai version of human production, with shorter but yet complex
sentences. Human evaluation is performed over a sentence completion task, where
GePpeTto's output is judged as natural more often than not, and much closer to
the original human texts than to a simpler language model which we take as
baseline
Interpreting vision and language generative models with semantic visual priors
When applied to Image-to-text models, explainability methods have two challenges. First, they often provide token-by-token explanations namely, they compute a visual explanation for each token of the generated sequence. This makes explanations expensive to compute and unable to comprehensively explain the model's output. Second, for models with visual inputs, explainability methods such as SHAP typically consider superpixels as features. Since superpixels do not correspond to semantically meaningful regions of an image, this makes explanations harder to interpret. We develop a framework based on SHAP, that allows for generating comprehensive, meaningful explanations leveraging the meaning representation of the output sequence as a whole. Moreover, by exploiting semantic priors in the visual backbone, we extract an arbitrary number of features that allow the efficient computation of Shapley values on large-scale models, generating at the same time highly meaningful visual explanations. We demonstrate that our method generates semantically more expressive explanations than traditional methods at a lower compute cost and that it can be generalized to a large family of vision-language models
- …